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^ ' The dynamic approximate membership problem asks to represent a set 5* of size n, whose ele- 

ments are provided in an on-line fashion, supporting membership queries without false negatives 
and with a false positive rate at most e. That is, the membership algorithm must be correct on 
each X € S, and may err with probability at most e on each x ^ S. 

We study a well-motivated, yet insufhciently explored, variant of this problem where the size 
n of the set is not known in advance. Existing optimal approximate membership data structures 
ry\ ' require that the size is known in advance, but in many practical scenarios this is not a realistic 

assumption. Moreover, even if the eventual size n of the set is known in advance, it is desirable 
to have the smallest possible space usage also when the current number of inserted elements is 

Y^ , smaller than n. Our contribution consists of the following results: 

• We show a super-linear gap between the space complexity when the size is known in advance 
^sg ' and the space complexity when the size is not known in advance. When the size is known 

^ , in advance, it is well-known that 0(nlog(l/e)) bits of space are necessary and sufficient 

OO ' (Bloom '70, Carter et al. '78). However, when the size is not known in advance, we prove 

that at least (1 — o(l))rilog(l/e) -I- ri(nloglogri) bits of space must be used. In particular, 
the average number of bits per element must depend on the size of the set. 

_il ' • We show that our space lower bound is tight, and can even be matched by a highly efficient 

f^ , data structure. We present a data structure that uses (1 + o(l))nlog(l/e) + O(nloglogn) 

CO ' bits of space for approximating any set of any size n, without having to know n in advance. 

Our data structure supports membership queries in constant time in the worst case with 
high probability, and supports insertions in expected amortized constant time. Moreover, 
it can be "de-amortized" to support also insertions in constant time in the worst case with 
high probability by only increasing its space usage to 0(7ilog(l/e) -f nloglogri) bits. 
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1 Introduction 

Dictionaries play a fundamental role in the design and analysis of algorithms, enabling representation 
of any given set S while supporting membership queries. For sets of size n that are taken from a 
universe U of size u, any dictionary must clearly use at least log (J^) = nlog{u/n) + G(n) bits 
of space^. Whereas dictionaries offer exact representations of sets, in many realistic scenarios it is 
desirable to trade exact representations with approximate ones in order to reduce space consumption. 
This was observed already by Bloom [Bio 70], whose classical design of a Bloom Filter provides a 
simple and practical alternative to dictionaries. 

Bloom's data structure solves the problem known these days as the approximate membership 
problem. This problem asks to represent any given set S of size n while supporting membership 
queries without false negatives, and with a false positive rate at most e. That is, the membership 
algorithm must be correct on any x £ S, and may err with probability at most e on any x £ U \ S 
(where the probability is taken over the randomness used by the data structure). The approximate 
membership problem can be considered in the static setting where the set is specified in advance, or 
in the dynamic setting where the elements of the set are specified one by one in an on-line fashion. 

Bloom's data structure uses only loge • ?ilog(l/e) bits of space (and solves the problem even 
in the dynamic setting), and Carter et al. [CFG''~78] proved that this is essentially optimal: Any 
approximate membership data structure must use at least nlog(l/e) bits of space, even in the static 
setting. Over the years a long line of research has shown how to design approximate membership 
data structures that are essentially optimal in both their space utilization and efficiency of their 
operations. We refer the reader to the survey of Broder and Mitzenmacher [BM03] for various 
applications for approximation membership data structures, and to Section 1.2 for an overview of 
the known results. 

Approximating sets of unknown sizes. The vast majority of existing approximate member- 
ship data structures require that the size n of the set S to be approximated will be known in 
advance. In many practical scenarios, however, it is unrealistic to assume that the size is known in 
advance [GWC~''06]. Moreover, even if the eventual size n of the set is known in advance, it is de- 
sirable to have the smallest possible space usage also when the current number of inserted elements 
is smaller than n. 

In this paper we study the well motivated, yet insufficiently explored, variant of the dynamic 
approximate membership problem where the size n of the set is not known in advance. We refer to 
this problem as approximate membership for sets of unknown sizes. This problem is parameterized by 
M E N and < e < 1, and asks to design a data structure offering three algorithms: Initialize, Insert, 
and Membership. Upon initialization via the Initialize algorithm, the data structure is presented 
with a sequence of elements that are taken from a universe U of size u. The elements are specified 
in an on-line fashion, and each element is processed using the Insert algorithm that updates the 
internal state of the data structure. The Membership algorithm should satisfy the following two 
requirements: 

• No false negatives: For any n < u, S '^ U oi size n, and x € S, the Membership algorithm 
always outputs Yes on x after the elements of S are processed by the Insert algorithm. 

• False positive rate at most e: For any n < u, S C U of size n, and x ^ S, the Membership 
algorithm outputs Yes on x with probability at most e after the elements of S are processed by 
the Insert algorithm (where the probability is taken over the randomness of the data structure). 



^Throughout this paper aU logarithms are to the base 2. 



Gradually-increasing space consumption. For the approximate membership problem when 
the size n of the set is known in advance, it is well-known that 0(nlog(l/e)) bits of space suffice 
even in the dynamic setting, and are essential even in the static setting (recall that a Bloom filter 
uses 0(n log(l/e)) bits which is asymptotically optimal). That is, the average number of bits for 
representing each element is G(log(l/e)) which is independent of the size of the set. 

In this light, a natural question is whether this is also the case when the size n is not known 
in advance, and the data structure is required to work for sets of any size n < u. That is, we 
ask the following question: Is there a dynamic approximate membership data structure that uses 
space 0(n log(l/e)) for representing any set S of any size n < ul Somewhat surprisingly, this 
question was so far addressed only from a practical perspective, and has not been investigated from 
a foundational perspective. Moreover, the data structures we could find in the literature [ABP+OT, 
HKL08, GWC+06, GWC+10, WJZ+11, WJZ+13] use space n{n\ogn) bits (and query time VL{\ogn) 
or r2(log(l/e))). These solution are somewhat naive from an algorithmic point of view, and provide 
poor asymptotic bounds. 

1.1 Our Contributions 

We present a lower bound and matching upper bounds on the space complexity of approximate 
membership for sets of unknown sizes. Our lower bound shows that if the size n of the size of 
the sets to be approximated is not known in advance, then it is not possible to use an average 
of 0(log(l/e)) bits per elements as in the standard case. Specifically, we show a super-linear gap 
between the space complexity when n is known in advance and the space complexity when n is not 
known in advance. We prove the following theorem: 

Theorem 1.1 (Lower bound - informal). Any data structure for approximate membership for sets 
of unknown sizes with false positive rate e must use space (1 — o(l))nlog(l/e) -|- r2(nloglogn) hits 
after some number of insertions n > u , for any arbitrary small constant < (5 < 1. 

In particular, Theorem 1.1 states that the average number of bits per element must be at least 
(1 — o(l)) log(l/e) -|- J7(loglogn) at some point in time while processing a not-too-short sequence. 
We emphasize that in many practical scenarios (see [BM03]) a typical false positive rate is a not- 
too-small constant (e.g., e = 1/10). For such a range of parameters our lower bound states that the 
average number of bits per element must be r2(loglogn) as opposed to constant. 

We then show that our lower bound is asymptotically tight by presenting two constructions 
with a space usage that matches our lower bound up to additive lower order terms. We prove the 
following theorem: 

Theorem 1.2 (Upper bound - informal). There exists a data structure for approximate membership 
for sets of unknown sizes with false positive rate e that uses space (l-|-o(l))nlog(l/e)-|-0(nloglogn) 
bits for any sequence of n > u insertions, for any arbitrary small constant < 6 < 1. 

Our first construction (which can be viewed as a warm-up) is quite natural and uses a sequence 
of dynamic approximate membership data structures of geometrically-increasing sizes. It supports 
insertions in expected amortized constant time, but membership queries are supported in time 
O(logn). Our second construction is significantly more subtle, showing that in fact our space 
lower bound can be matched by a highly efficient data structure supporting membership queries 
in constant time in the worst case with high probability (while still enjoying expected amortized 
constant insertion time as in our first construction). Moreover, we show that it can be "de-amortized" 
to support also insertions in constant time in the worst case with high probability by increasing its 
space usage from (1 -|- o(l))n log(l/e) + 0{n log log n) bits to 0{n log(l/e) -|- n log log n) bits (with a 



rather small leading constant). We refer the reader to Section 1.3 for a high-level overview of the 
main ideas underlying our lower bound and constructions. 

Finally, we note that in both our lower bound and constructions we consider approximate rep- 
resentation of sets whose size n is polynomially related to the universe size u (i.e., n > u for any 
arbitrary small constant < 6 < 1). This is rather standard for exact or approximate represen- 
tation of sets as one can always apply a universe reduction via simple universal hashing given any 
polynomial upper bound on the number of elements. 

1.2 Related work 

Bloom filters. The elegant data structure proposed by Bloom [Blo70] naturally allows dynamic 
insertions, but uses space that is a factor loge ~ 1.44 larger than the information theoretic lower 
bound of nlog(l/e) bits [CFG+78]. Another thing to notice is that Bloom filters do not allow 
deletions from S, as setting any bit to could result in false negatives. 

Deletion queries can be supported by using counting Bloom filters [FCA+00], at the cost of 
an il(loglogn) factor increase in space usage. Deletions are supported in the sense that the data 
structure will work correctly if no attempt is made to delete a false positive, but by definition it is 
not possible to prevent such deletions. Cohen and Mafias [CMOS] present a way of decreasing the 
space overhead to 0{n) bits, and generalize approximate membership to approximate multiplicity 
in a multiset. 

Dictionary-based approximate membership. Already in 1978, Carter et al. [CFG'''78] had 

presented a technique that would lead to a similar result. They observed that maintaining the 
multiset h{S), where h : [u] — )■ [n/e] is a universal hash function [CW79], yields a solution with 
space nlog(l/e) + 0{n) bits if the set h{S) is stored in space close to the information theoretic 
bound of log {"' n) bits. If deletions are not needed it suffices to store the set of distinct hash 
values h{S). This dynamic set can be stored succinctly with all operations taking 0(1) time with 
high probability [ANSIO]. Dynamic multisets, and thus deletions, can be supported via a reduction 
to the standard membership problem [PPR05], at the cost of amortized expected update bounds. 
A more practical alternative was explored in [BMP"*" 06]. 

Separation of on-line and oflf-line space requirements. Dietzfelbinger and Pagh [DP08] 
showed how to approach the nlog(l/e) space lower bound up to a o[n) term using query time 
w(log(l/e)), in the case where e is an integer power of 2. Independently, Porat [Por09] achieved the 
same result with constant query time. Recently, Bellazougui and Venturini [BV13] showed how to 
eliminate the restriction on e, still maintaining constant query time. 

Lovett and Porat [LP 10] showed that these results for the static case do not extend to the 
situation where dynamic updates are allowed: An overhead of r2(n/ log(l/e)) bits is required. The 
lower bound holds even if there are no queries before the end of the insertion sequence. In other 
words, this result implies that to build an approximate membership data structure for a key set 
given as a data stream, it does not suffice to use space close to the static size lower bound. 

Dynamic space usage. The setting where space must depend on the current size of the set is 
more demanding from an upper bound perspective. In fact, the techniques for this problem that 
we could find in the literature [ABP+07, HKL08, GWC+06, GWC+10, WJZ+11, WJZ+13] lead to 
Q{logn) or r2(log(l/e)) query time, and a space overhead of Q(nlogn) bits. These data structures 
share the idea of working with a sequence of approximate membership data structures, all of which 
are queried. If geometrically increasing capacity is chosen this means that there will be Q{logn) such 



data structures (of course, if we have some initial capacity no this number decreases to O(log(n/no)), 
which might be fine in practical situations - but it is not asymptotically optimal). A consequence 
of working with a series of approximations is that the sum of corresponding false positive rates 
ei, €2, . . . must converge to e. For example, In [ABP+07] it is suggested to achieve this by letting e^ 
decrease geometrically with i. This implies that e^ = n~^^^' , yielding r2(nlogn) space usage. 

Dynamic perfect hashing and retrieval. An approach to approximate membership in the static 
case is to store a perfect hash function that maps keys injectively to {1, . . . ,n}, and then store a 
signature of log(l/e) bits for each key in an array, placed according to the perfect hash function. 
More generally, a dynamic data structure for retrieval (e.g. the B homier filters of [CKR+04]) allows 
us to make a dynamic approximate membership data structure. As shown in [MPP05] both these 
problems require space 0(nloglogn) in the on-line setting. However, the upper bounds have a fixed 
space usage (up to constant factors), and hence do not allow the kind of result we obtain. 

1.3 Overview of Our Contributions 

In this section we provide a high-level overview of the main ideas underlying our lower bound and 
constructions. 

The lower bound: From Approximate membership to compression. When dealing with 
dictionaries (i.e., with exact membership as opposed to approximate membership), it is quite simple 
to deal with the fact that the size of the set S to be stored is not known in advance. Specifically, 
at any point in time a dictionary stores a description of the set S of elements that were inserted 
so far. Then, upon inserting a new element x £ U \ S this description can simply be updated to 
that of S" = S" U {x}. The dictionary can describe 5 and then 5' using the minimal, information- 
theoretic, number of bits. Moreover, there are even time-efficient solutions that gradually increase 
the size of the dictionary while offering constant-time operations in the worst case together with 
an asymptotically-optimal space consumption at any point in time (see, for example, the work of 
Dietzfelbinger and auf der Heide [DMadHOO] ) . 

When dealing with approximate membership, however, it seems significantly more challenging 
when the size of the set S to be approximated is not known in advance. For simplifying the following 
discussion we consider here deterministic approximate membership data structures, but note that 
the exact same ideas carry over to randomized ones. Specifically, for using asymptotically optimal 
space, an approximate membership data structure cannot afford to store an exact description of the 
set S of elements that were inserted so far. Instead, any particular state of the data structure may 
be used for many sets other than S, and the result of the Membership algorithm must be Yes on any 
element that belongs to the union S of these sets. Upon inserting a new element x € U \ S, the 
data structure has to update the description of the current superset S to the description of some 
superset S' of S' = S U {x}. Note, however, that the data structure does not have access to the 
set S, but only to the approximation S containing S. Therefore, it must hold that 5" C 5' as any 
element of S might have been inserted, and false negatives are not allowed. 

The main observation underlying our lower bound proof is that not only the new superset S' has 
to be larger than the old one S (as S C S'), but it actually has to be significantly larger. That is, 
upon the insertion of an element, the data structure must update its internal state by adding many 
elements to the currently stored superset. This is in contrast to the setting of exact membership 
discussed above, where upon the insertion of an element, a dictionary can update its internal state 
by adding only the newly added element to the currently stored set. We formalize this observation 
via a compression argument showing that ii S'\S is rather small, then we can "compress" the set S' 



below the information-theoretic lower bound. We note that this argument takes into account only 
space utilization, and does not need to make any assumptions on the efficiency of the data structure 
in terms of the time complexity of its Insert and Membership algorithms. We refer the reader to 
Section 3 for the proof of our lower bound. 

Construction 1 (warm-up): Geometrically- increasing data structures. Our first con- 
struction is quite natural and uses a sequence of dynamic approximate membership data structures 
of geometrically-increasing sizes. When viewing the sequence of inserted elements as consecutive 
subsequences, where the ith subsequence consists of 2* elements, at the beginning of the ith sub- 
sequence we allocate a dynamic approximate membership data structure Bi with a false positive 
rate Ej = 0(e/i^). The elements of the ith subsequence are processed by Bi. A membership query 
for an element x G U is performed by invoking the membership algorithm of each of the existing 
data structures Bi, and reporting Yes if any of them does. Clearly, the construction has no false 
negatives, and its false positive rate is at most X^^^ e, < e. 

By carefully instantiating the underlying Sj's with existing dynamic approximate membership 
data structures, for any sequence of n insertions the data structure uses only (1 -|- o(l))nlog(l/e) -|- 
0(n log log n) bits of space, and insertions are performed in expected amortized constant time. 
However, membership queries require time G(logn) after n insertions, as a separate membership 
query is needed for each of the existing iSj's.^ We refer the reader to Section 4 for more details. 

Construction 2: Constant-time operations. Whereas our first construction is somewhat 
naive, our second construction is significantly more subtle, supporting membership queries in con- 
stant time in the worst case with high probability (while still enjoying expected amortized constant 
insertion time as in our first construction). Moreover, we show that it can be "de-amortized" 
to support also insertions in constant time in the worst case by increasing its space usage from 
(1 -|- o(l))77,log(l/e) -|- 0(?iloglogn) bits to 0(n log(l/e) -|- n log log n) bits (with a rather small 
leading constant). 

Unlike our first construction, this construction consists of only one data structure at any point 
in time. This data structure is a dynamic dictionary (i.e., an exact representation of a set) that is 
used for storing a carefully chosen superset of the elements that were inserted so far. For describing 
the main ideas underlying this construction, we again view the sequence of inserted elements as 
consecutive subsequences, where the ith subsequence consists of 2* elements. The construction is 
initialized by sampling a function /i : [/ — t- {0, 1}^ from a pairwise independent collection of functions, 
where i > [log(l/e)] + logu + 2 (recall that e is the required false positive rate, and that u is that 
size of the universe of elements) . 

The basic idea is that for inserting an element x as part of the ith subsequence, we store in the 
current dictionary T>i the value hi{x) that is defined as the leftmost ii = [log(l/e)] +i+2 bits of h{x). 
At the end of the ith subsequence, we transition from the current dictionary Dj to a newly allocated 
dictionary P^+i, and de-allocate the space used by Pj. The transition is performed as follows: As 
Vi is a dictionary, we can enumerate all of its stored values, and for each such value y E {0,1}^* we 
insert both yO G {0, l}^^+i and yl e {0, l}^*+i to the new dictionary fj+i. Note that P, stores ^^-bit 
values, and fj+i stores ^j+i-bit values. The key point is that at any point in time there is only one 
dictionary Pj, and therefore any membership query requires executing only one such query: Given 
an element x and given that the current dictionary is Pj, we execute a membership query for hi{x) 
in T>i. Therefore, the time for supporting membership queries is identical to that of the underlying 
dictionaries. 



We note that these Q{\ogn) membership queries can be executed in parallel. 



This approach, however, needs to be refined as the number of stored values increases too fast. 
To match our lower bound, we would like to argue that each dictionary T>i stores 0(2*) values. This 
is not the case: Each of the 2^ elements that are inserted as part of the jth subsequence, for any 
j < i, "contributes" 2^~^ values to Pj, and therefore the number of values stored by Aogn would be 
O(nlogn) instead of 0(n). 

We resolve this difficulty as follows. For inserting an element x as part of the ith subsequence, 
we store in Pj the pair {hi{x),gi{x)), where hi{x) is defined as the leftmost ii = [log(l/e)] +i + 2 bits 
of h{x) (as before), and gi{x) is defined as the next r = [log log m] output bits of h{x) (padded with 
the symbol _L when less than r such bits are available)^. The transitioning from Pj to Pj+i is now 
performed as follows: For each pair {y, ai ■ ■ ■ ar) G {0, 1} * x {0, 1, _L}^ that is stored in "Dj we insert 
to Pj+i either the pair {yai, 02- • • Or-i-) if ai / -L (using yai as its key), or the two pairs (yO, a) and 
{yl,a) if ai = _L (using yO and yl as the respective keys). This way, each of the 2^ elements that 
are inserted as part of the jth. subsequence, for any j < i, "contributes" only 2'~-'~^ values to T>i, 
which guarantees that each Di stores only 0(2') values. This, combined with a standard bucketing 
argument, enables us to match our space lower bound of using (1 + o(l))?7,log(l/e) + 0(n log log n) 
bits for any sequence of n insertions. Note that this method has no false negatives, and we show 
that our choice of parameters guarantees that the false positive rate is at most e. Moreover, the 
time for supporting membership queries is identical to that of the underlying dictionaries. 

The construction enjoys a good amortized insertion time: Most insertions correspond to stan- 
dard insertions for the current T>i, while only a small number of insertions require transitioning from 
Pj to Pi+i. Specifically, we show that if the underlying dictionaries support insertions in expected 
amortize constant time, then so does our construction. Moreover, we also show that if the under- 
lying dictionaries offer a constant insertion time in the worst case with high probability, then our 
construction can be modified to offer constant time insertions in the worst case with high probability. 
This follows the de-amortization technique of Arbitman, Naor and Segev [ANS09, ANSIO], and only 
increases the space usage from (1 + o(l))n log(l/e) + 0{n log log n) bits to 0{n log(l/e) -|- n log log n) 
bits (with a rather small leading constant). 

Finally, we also show that our construction can even support deletions, as long as no false 
positives are deleted. We refer the reader to Section 5 for more details. 

1.4 Paper Organization 

The remainder of this paper is organized as follows. In Section 2 we present some essential prelim- 
inaries. In Section 3 we prove our lower bound. In Sections 4 and 5 we present our constructions. 
Finally, in Section 6 we discuss various directions for future research. 

2 Preliminaries 

Notation. For an integer n S N we denote by [n] the set {1, . . . ,n}. For a random variable X we 
denote by j; -^ X the process of sampling a value x according to the distribution of X. Similarly, 
for a finite set S we denote by x -^ S" the process of sampling a value x according to the uniform 
distribution over S. 

Computational model. We consider the unit cost RAM model in which the elements are taken 
from a universe of size u, and each element can be stored in a single word of length w = [log u\ 
bits. Any operation in the standard instruction set can be executed in constant time on w-hit 



^Such a pair {hi{x) , gi{x)) is inserted using hi{x) as its key, for enabling constant-time membership queries. 



operands. This includes addition, subtraction, bitwise Boolean operations, left and right bit shifts 
by an arbitrary number of positions, and multiplication. The unit cost RAM model has been the 
subject of much research, and is considered the standard model for analyzing the efficiency of data 
structures (see, for example, [DP08, Hag98, HMPOl, Mil99, PP08, RR03] and the references therein). 

fc-Wise independent functions. A collection T-L of functions h : U ^ V is k-wise independent 
if for any distinct xi, . . . , Xfc G U and for any j/i, . . . , y^ E F it holds that 

Pr [h{xi) = yi A • • • A h{xk) = Vk] 



More generally, a collection Ti is fc-wise (^-dependent if for any distinct xi, . . . ,Xk G U the distribu- 
tion {h{xi), . . . ,h{xk)) where h is sampled from Ti is J-close in statistical distance to the uniform 
distribution over U . 

3 The Lower Bound: From Approximate Membership to Compression 

Let P be an approximate membership data structure for sets of unknown size, for a universe U of 
size u with a false positive rate < e < 1. Our lower bound holds for any such data structure 
that supports insertions and membership queries. We assume T> has access to a read-only array of 
random bits at no cost in space, allowing randomized data structures. Since we do not limit the 
time complexity of the Membership algorithm, and all possible histories of the data structure can be 
computed using the random bits array, we may without loss of generality assume that D answers 
Yes on input x exactly when the current state of the data structure is consistent with some history 
in which x was inserted using the current array of random bits. 

In this section we prove a lower bound on the space usage of P even if the size of the sets 
to be approximated is known to be in a certain interval (this only strengthens the lower bound). 
Specifically, we prove the following theorem: 

Theorem 3.1. Let D, U , and e be as above, and let n < eu be sufficiently large and i/\/n < a < 1. 
If for any sequence of insertions of any length m such that an < m < n, the data structure T> uses 
at most f3m bits of space, then for any integer j > 2 it holds that 

/3 > (l - -V (log(l/e) + (1 - 9e) log log^(l/a) - G(l)) . 

In particular, by setting a = l/-v/n and 7 = 2^^^'^' , for some constant < r] < 1, we obtain the 
lower bound of (1 — o(l))nlog(l/e) + r2(nloglogn) bits that is stated in Theorem 1.1. We note that 
asking that the data structure is space efficient only for sequences of at least an elements can be 
viewed as allowing the data structure to process the first an elements in an off-line manner using 
arbitrary space (which, again, only strengthens the lower bound). We also note that setting a = 1 
corresponds to the case where the size n of the set is known in advance. 

The proof of Theorem 3.1 consists of two parts. In the first part (see Section 3.1) we show 
that it suffices to prove the lower bound for deterministic data structures, where the probability of 
false positives is taken over the choice of a uniformly sampled element (instead of over the internal 
randomness of the data structure). This is a standard averaging argument showing that one can fix 
the randomness of any randomized data structure, without significantly increasing the false positive 
rate. In the second part (see Section 3.2), we then follow the overview discussed in Section 1.3 for 
proving the lower bound for deterministic data structures. 



3.1 From Randomized to Deterministic Approximate Membership 

For any (sufficiently long) string r G {0,1}*, we denote by Br the deterministic data structure 
obtained by fixing r as B^s internal randomness. In addition, for any sequence S G C/" of n insertions 
we denote hy Sr '^ U the set of all elements on which the Membership algorithm of Br outputs Yes 
after processing the sequence S. We note that S G C/" is an ordered sequence of (not necessarily 
distinct) elements, while Sr is a set. When we refer to the elements of S we may abuse notation 
and treat S as a set. Note that the fact that there are no false negatives guarantees that S C Sr for 
any r G {0, 1}*. Finally, for each such r and S define fJ,{Sr) = \Sr\/u, and define 



Sr,e = (5 G [/" : fi{Sr) < 4e} . 



The following lemma uses an averaging argument and states that there exists a choice of r G 
{0, 1}* such that fJ.{Sr) is rather small for many sequences S (i.e., that the set Sr,e consists of many 
sequences) . 

Lemma 3.2. Let B, u, e and n be as above. Then, there exists a string r* G {0, 1}* such that 

\Sr*,e\ > u"/2. 

Proof. The randomized data structure B has false positive rate at most e, and therefore for any 
sequence S G [/" it holds that 

Er^lo,!}* K\Sr\) < e + n/u < 2e. 
By Markov's inequality it holds that 



Pr 

r<-{0,l}* 



K\Sr\)>^e 



1 
< -. 

- 2 



In particular, there exists an r* G {0, 1}* for which for at least 1/2 of all the sequences S G W^ it 
holds that n{\Sr\) < 4e. ■ 

3.2 A Compression Argument for Deterministic Approximate Membership 

Form this point on focus on the deterministic data structure Br*, where r* G {0, 1}* is the internal 
random string r* provided by Lemma 3.2. In this part of the proof we show that the data structure 
Br* can be used to encode the sequences in a large subset of 5 = Sr*,e- Since Lemma 3.2 provides 
a lower bound on the cardinality of S, it also provides a lower bound on the length of such an 
encoding. 

Let 5" be a sequence in S and partition it into consecutive subsequences S = Ci, C2, ... such that 
each Ci consists of 7* elements, where 7 > 2 is an integer. We define Si to be the concatenation 
of the first i subsequences, and rii to be its length. In other words Si is the prefix of S of length 
Hi = X]j<j7-^. Observe that since there are no false negatives, then Si C Sj+i,^ and therefore 

fJ'{Si) < /i(5j+i) < 4e for every integer i. 

Lemma 3.3. For any sequence S £ S of length n, there exists an integer i such that \Si\ G [an,n] 

and 

4e 

n{Si) - fi{Si^i) < , , . -. 

log^(l/a) - 2 



^Recall that, as stated above, we may without loss of generahty assume that V answers Yes on input x exactly 
when the current state of the data structure is consistent with some history in which x was inserted using the current 
array of random bits. 



Proof. Let ji = '~log^(an(7 — 1) + 1)"' and J2 = Llog^(n(7 — 1))j. For every ji < i < J2 it holds 
that Hi £ [an,n]. Since n{Sj-^) > and fJ^iSj^) < 4e, and since for all i it holds that fi{Si) < /^(S'j+i), 
there must be an i G bi,j2] such that 

4e 4e 

fi{Si) - fJ.{Si-i) < < 



J2-ji log(l/a)-2' 



Fix a sequence S of length n, let i be the smallest integer that satisfies the condition in Lenmia 3.3, 
and let k = \Ci n Si-i\. That is, k is the number of elements from the subsequence Cj for which 
the Membership algorithm already answers Yes right before the ith. subsequence Q is processed by 
the Insert algorithm. Observe that since the data structure is deterministic, k = k{S) is completely 
determined by the sequence S. We are interested in the case k < 9e|Cj|. In the next lemma we show 
that for most sequences in S this is indeed the case. 

Lemma 3.4. It holds that 

\{SGS:k{S)<9e\Ci\}\>—. (3.1) 

Proof. Consider a sequence S which is uniformly sampled in [/" one subsequence after the other. 
We emphasize that we sample from V^ in order to avoid the dependencies associated with sampling 
from S. Assume that each prefix Sj is associated with an arbitrary set Sj with measure at most 4e. 
If it happens that Sj is a prefix of some sequence in S, then Sj is indeed defined as before to be the 
set of positive replies. Otherwise Sj can be any set in U of measure at most 4e. 

Now, since the subsequence Cj is sampled uniformly and independently from Sj-i, it holds 
that E[\Cj n Sj-i\ < 4e|Cj|], and by a Chernoff bound it holds that Pr[|Cj n ^-il > 9e|Cj|] < 
exp(— |Cj|). Under our assumptions \Cj\ > n^^' so by the union bound, with probability at least 
1 — log^ n-exp(— n^'-^)) > 1 — 1/n all the j for which \Cj\ is large enough satisfy \CjnSj^i\ > 9e|Cj|. 
Again, by the union bound we have 

\{SeS:k{S)<9e\Q\}\> Q-^)^"' 
from which the lemma follows for all n sufficiently large. ■ 

Assume that after the insertion of d the data structure uses space bi bits. We now describe the 
encoding itself for a given sequence S. 

First write the number i from Lemma 3.3, followed by an explicit uncompressed representation 
of all items in the sequence S, except those of Cj. This requires at most {n — Ci) log u + log log n 
bits, where Cj = \Ci\. We will use the data structure in order to encode Cj in a more compact form 
as follows. Recall that k items out of Cj are in S'j-i. We need at most Cj bits to denote where in the 
sequence these items are located. Next, we store the data structure itself using bi bits. We observe 
that since the data structure is deterministic and we write all the elements other than d explicitly, 
the encoding thus far characterizes the set S'j-i. Also, since the data structure itself is written, 
the encoding so far characterizes the sets Si. The remaining part of the encoding consists of the 
elements of Cj encoded relative to these two sets: We encode the Ci — k elements in Si\Si-i using 
(cj — k) log{{fj.{Si) — fi{Si-i))u) + 0(1) bits and the remaining k elements using k log{fi{Si)u) + 0(1) 
bits. All in all the length of this part of the encoding is: 

(q - k) log((/i(5,) - KSi~i))n) + klog{fi{S^)u) + 0(1). (3.2) 



By our choice of i we have 

log{fi{S^) - fi{Si^i)) < log(e) - loglog^(l/a) + 0(1). 

Plugging in (3.2) and using the fact that /u(5'j) < e, the length is at most 

(ci - k){logu + log(e) - loglog^(l/a)) + A;log(en) + 0{ci) 

< a (log n + log e - (1 - 9e) log log^(l/a) + 0(1)) , 

and the length of the remaining part of the encoding is at most 

bi + log log n + (n - Cj) log u + Ci. 

By (3.1), the total length of the encoding has to be greater than log(M"/3) so we have: 

bi + loglogn+ (n- a) log u + Ui (logu + loge - (1 - 9e) loglog^(l/a) + 0(1)) 
> nlogii — 0(1). 

which implies that 

k > Ci(log(l/e) + (1 - 9e)loglog^(l/a) - 0(1)). 

Finally, since Cj = 7* we have that Cj = (nj + —rj) ■ -^—— which, together with the assumption /3nj > bi 
in the statement of Theorem 3.1, completes the proof of Theorem 3.1. 

4 Construction 1 (Warm-Up): Geometrically-Increasing Data Structures 

Our first construction is quite simple and natural and uses a sequence of dynamic approximate 
membership data structures of geometrically-increasing sizes. When viewing the sequence of inserted 
elements as consecutive subsequences, where the ith. subsequence consists of 2* elements, at the 
beginning of the ith. subsequence we allocate and initialize a dynamic approximate membership 
data structure Bi with a false positive rate ej = 0(e/i^). The elements of the ith subsequence are 
processed by the insertion algorithm of the data structure Bi. A membership query for an element 
X £ U is performed by invoking the membership algorithm of each of the existing data structures 
Bi, and reporting Yes if any of them does. Clearly, as the underlying data structures have no false 
negatives, then our construction has no false negatives. In addition, a union bound guarantees that 
the false positive rate is at most X^^i e, = ©(evr^/G) < e by appropriately adjusting the constants 
in the choices of the ej. 

We can instantiate the BiS, for example, with the dynamic approximate membership data struc- 
ture resulting from the dynamic dictionary of Raman and Rao [RR03] (via the general dictionary- 
based methodology described in Section 1.2). This dynamic approximate data structure supports 
insertions in constant expected amortized time, membership queries in constant time in the worst 
case, and its space consumption is (1 + o(l))2* log(l/ej) bits for any set of known size 2* with a false 
positive rate e^. This guarantees that, for any number n of elements, the number of bits used by 
our construction after inserting any n elements is at most 



(l + o(l))n- max {log(l/e,) + 0(1)} 

yi<i<[logn] 

= (l + o(l))n-( max {log(l/e) + log (i^) + 0(1)} 

yl<i<[logn] 

= (1 + o(l))nlog(l/e) + 0(n log log n). 
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Note, however, that membership queries require time G(logn), since given an element x we do 
not know to which of the Bi it might have been inserted in. Therefore we need a separate membership 
query for each of the Bi. This yields the following theorem: 

Theorem 4.1. For any < e < 1 there exists a data structure for approximate membership for sets 
of unknown sizes with the following properties: 

1. The false positive rate is at most e. 

2. For any integer n, the data structure uses at most (1 + o(l))nlog(l/e) + 0(n log log n) bits of 
space after n insertions. 

3. Insertions take expected amortized constant time, and for any integer n membership queries 
are supported in O(logn) time after n insertions. 

5 Construction 2: Constant-Time Operations 

As in our first construction, when processing a sequence of elements we partition it into consecutive 
subsequences, where the ith subsequence consists of 2* elements. For every integer i we denote the 
ith subsequence by Sj = X2i-i • • • X2i-i, and denote by Si the set {x2»-i, . . . ,X2»_i}. 

Let ?^ be a pairwise independent collection of functions h : U ^ {0, 1}^, where £ > [log(l/e)] + 
logu + 2 and \U\ = u. For each h £% and integer i S [i\ we let hi : U ^ {0, 1}^' be the leftmost 
li = [log(l/e)] + i + 2 output bits of h, and let gi : U ^ {0, 1}^ be the next r = [log logu] output 
bits of h (padded with the symbol _L when less than r such bits are available). 

The basic construction. The data structure is initialized by sampling a function h ^ %. At 
any point in time, when the ith subsequence Si is being processed, the data structure consists of a 
dynamic dictionary 2?j. As discussed in Section 1.3, the insertion procedure operates in one out of 
two possible modes, depending on whether or not the element that is currently being inserted is the 
first element of its subsequence. We describe each of these modes separately. 

• Mode 1. When the inserted element x G Sj is not the first of its subsequence, we store the 
pair {hi{x) , gi{x)) in the current dictionary T>i using hi{x) as its key. 

• Mode 2. When the inserted element x £ Si is the first of its subsequence (i.e., x = X2i-i), we 
transition from the current dictionary Pj_i to a new dictionary T>i, deallocate the space used 
by "Dj-i, and then proceed as in mode 1 above. 

Specifically, the dictionary Pj is initialized for storing at most 2*"*"^ elements, each of length 
ii + r bits. If i > 1 we initialize Pj by enumerating all pairs currently stored by 2?j_i, and 
processing each such pair {y, ai ■ ■ ■ Ur) G {0, l}^'-i x {0, 1, _L}'^' as follows: If ai 7^ _L, we insert 
to Di the pair {yai, 02- ■ ■ ar-L) using yai as its key. Otherwise, we insert to Pj the two pairs 
(yO, a) and (yl, a) using yO and yl as their keys, respectively. 

Membership queries are naturally defined: Given an element x £ U and that the currently 
dictionary is Pj for some i, we query T>i with the key hi{x) to retrieve a pair of the form {hi{x), a) 
for some a. If such a pair is found we output Yes, and otherwise we output No. 
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Dealing ■with failures. We note that a subtle point in the construction is that each of the 
dictionaries 2?i,2?2, ■ ■ ■ niay have a certain failure probability. Using existing dictionaries, the failure 
probability for each T>i can be made as small as any polynomial in 2~*. This means that whenever 
i = Q(logu), the failure probability can be made polynomially small in u, but when i = o{logu) the 
failure probability is rather large. 

There are two standard methods for dealing with such large failure probabilities. The first is 
to simply rebuild each T>i that fails. Even for small values of i, the expected number of failures is 
typically a small constant, and thus we will be able to guarantee good expected performance. The 
second is to group together into one dictionary the first u elements, for an arbitrary small constant 
< (5 < 1. This way, a union bound shows that no dictionary fails except with probability u~'^ 
for any pre-determined constant c > 1. For simplicity, in what follows we analyze our construction 
assuming that at least n > u^ elements are inserted, and that we group together the first u^ elements. 

Optimal space via bucketing. Note that the transitioning from each dictionary Vi to Pj+i 
requires storing both until all elements of Pj have been transitioned into P^+i (as explained above). 
This increases the space used by the data structure by a multiplicative constant factor. Using a 
standard bucketing technique (see, for example, [DMadHQO, DR09]) we reduce the space usage of the 
construction when at least n > u elements are inserted, for an arbitrary small constant < 5 < 1. 

Specifically, we first hash the elements into ti " buckets, and then apply our basic construction 
in each bucket. For enabling the data structure to gradually allocate more space, the data structures 
in the buckets are interleaved word-wise: For every i G [^i '^], the data structure of the ith bucket 
resides in memory words whose location is equal to i modulo u '^. This guarantees that if the 
maximum space usage of the data structures in the buckets is Smax words, then the total space 
required for the construction is u '"^ ■ Smax words (and additional space can be easily allocated). 

For any u < n < u the hash functions of [DMadHOO, DR09] split the elements quite evenly: each 
bucket contains at most (1 + o(l))n/n ' ^ elements, except with a probability that is polynomially 
small in u. Moreover, these functions can be evaluated in constant time. Applying our basic 
construction in each bucket guarantees that the transitioning operation occurs in at most one bucket 
at any point in time, and therefore the additional space that is required is proportional to the number 
of elements in each bucket and not to total number of elements. 

Performance analysis. The following theorem is obtained by instantiating our construction with 
a sufficiently good construction of a dynamic dictionary. For example, the dynamic dictionary of 
Raman and Rao [RR03] is space optimal (up to additive lower-order terms), supports insertions in 
constant expected amortized time, and membership queries in constant time in the worst case. 

Theorem 5.1. For any < e < 1, integer u, and constant c > 1, there exists a data structure 
for approximate membership for sets of unknown sizes from a universe of size u with the following 
properties: 

1. The false positive rate is e + u~^. 

2. For any constant < (5 < 1 and n > u , the data structure uses at most (1 + o(l))nlog(l/e) + 
0(n log log n) bits of space after n insertions. 

3. Insertions take expected amortized constant time, and membership queries take constant time 
in the worst case. 

Proof. As discussed above, hashing the inserted elements into u ' ^ buckets results in a balanced 
allocation up to additive lower order terms with all but a polynomially small probability in u. 
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Therefore, for simplicity, from this point on we focus on n elements that are inserted into a single 
bucket. We first prove that for every i, at most 2'"*"^ elements are inserted into the dictionary Pj. Fix 
an i, and partition the elements that are inserted to T>i to two disjoint sets: elements that correspond 
to elements from Si, ... , Si-r, and elements that correspond to elements from Si-r+i, . . . ,Si. For 
each element x that belongs to some Sj, we observe that it contributes 2^~^~'^ elements if 1 < j < i—r, 
and exactly one element if i — r + 1 < j < i. Therefore, the number of elements that are inserted 
into T>i is 

i—r i i—r i 

Y^ \Sj\ ■ 2'-^-'- + Y, \^j\ = Yl 2'"'"^ + Yl '^^ 

j=l j=i—r+l j=l j=i—r+l 

< 2^+2. 

Now, for bounding the false positive rate, fix a sequence xi- ■ ■ Xj of inserted elements, an element 
X ^ {xi, . . . ,Xj}, and let i be such that 2*~^ < J < 2* — 1. Then, the current state of the data 
structure consists of a dictionary Pj, and a query for x initiates a membership query for the key hi{x). 
Since at most 2*"*"^ keys were inserted so far to the dictionary Pj, the pairwise independence of "H 
guarantees that x forms a collision with some existing element with probability at most 2*"'"2-2~^* < e. 
In addition, we assume that the constructions of all the Di are successful except with probability 
n"'^, and therefore the false positive rate is at most e + u~^. 

We now bound the space overhead. Assume that 2*~^ < n < 2* — 1 elements were inserted, and 
that the current dictionary T>i is constructed using a dictionary that can store n elements from a 
universe of size u' = poly(n') with r bits of satellite data using space (1 + o(l))n(log(ti'/^) + i") bits 
(e.g., [RR03] as discussed above). Then, the space utilized by Vi is at most 

(l + o(l))n(log(2^72*)+r) < (1 + o(l))nlog(l/e) + ©(nloglogu) 

= (1 + o(l))nlog(l/e) + 0(n log log n) 

Finally, note that membership queries are supported in constant time, and that the expected amor- 
tized insertion time is also constant (as in the underlying dictionary). ■ 

In the remainder of this section we describe two extensions of our construction. The first exten- 
sion shows how to enjoy constant-time insertions in the worst case by increasing the space usage 
from (1 + o(l))n log(l/e) -|- 0{n log log n) to 0{n log(l/e) -|- n log log n). The second extension shows 
how to support deletions (which have to be carefully defined). 

Constant-time insertions in the worst case via de-amortization. As presented above using 
the two different insertion modes, the construction enjoys a good amortized insertion time: Most 
insertions correspond to mode 1 and are processed very fast, while only a small number of insertions 
correspond to case 2. The main observation is that if the underlying T>i offers a constant insertion 
time in the worst case with high probability (e.g., as in [ANSIO]), then our construction without 
the bucketing can be de-amortized: Instead of initializing each Dj only when inserting X2i-i, then 
the total amount of work required for initializing Di can be equally split among the insertions of 
X2i-i, . . . ,X2i_i. Specifically, on each such insertion, devote a constant number of additional steps 
for the initialization of Pj. As shown in the proof of Theorem 5.1, for every i at most 2*"*"^ elements 
are inserted into the dictionary T>i-i. Therefore, the total amount of work (in the worst case) 
required for initializing T>i is 0(2*). We note that the idea of bucketing the elements that we used 
above does not seem useful here. The reason is that it is no longer the case that a transition between 
dictionaries occurs in at most one bucket at any point in time. Therefore, the space usage (even with 
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bucketing) would be 0(n log(l/e) + n log log n) bits (with a rather small leading constant) instead 
of (1 + o(l))nlog(l/e) + O(nloglogn) bits as in Theorem 5.1. 

Supporting deletions. Note that for any approximate membership data structure it is impossible 
to detect if an attempt is made to delete a false positive. Thus, the data structure must put the 
burden on its user to ensure that deletions are applied only to elements that are in fact in the set 
(if this contract is broken, false negatives may arise). In the space analysis we will also assume that 
insertions are proper, i.e., an element may be inserted at most once. 

A well-known approach to supporting deletions [PPR05] is to store the multiset of signatures 
rather than just the set of distinct signatures. A deletion of an element with signature h{x) is 
implemented by decreasing the multiplicity of h{x) in the multiset by 1. However, there are compli- 
cations when trying to make this technique work in our setting. For example, in Construction 1, the 
element to be deleted may be a false positive in one of the data structures Bi and a "true positive" 
in another data structure Bj. The problem is that there is no way to tell which is the false positive, 
and if we remove h{x) from Bi a false negative will occur. A similar problem occurs in Construction 
2, where it may not be possible to determine which signatures are to be deleted. 

Our way around this problem is to abandon the idea of storing a multiset of signatures, but 
rather use a secondary dictionary data structure whenever we encounter identical signatures. In the 
following we describe how to augment Construction 2 with deletions. When inserting an element x 
we first check if it is a false positive of the existing set. Every false positive is inserted in the secondary 
data structure, while remaining elements are inserted in the primary data structure. Membership 
queries are extended to also look up the element in the secondary data structure, which has zero false 
positive rate. The deletion algorithm first checks if the element can be deleted from the secondary 
data structure. If not, its signature(s) need to be deleted from the primary structure. However, 
elements that were inserted when the set was smaller may be associated with a large number of 
signatures, formed by extending an original signature with all possible bit strings to form a set of 
possible signatures matching the current signature length. To allow efficient deletion, we extend 
the information that is stored with each key with the length of the original signature, and with a 
bit that can be used to indicate deletion. Deletion is performed by marking the lexicographically 
smallest signature in the set (i.e., the one extended with only zeros) as deleted. The membership 
procedure is then modified to compute this signature, and check whether it has been marked as 
deleted. To ensure that we do not use significant space for signatures of deleted keys, we run a 
background process that periodically checks if each signature can be removed from the set, spending 
constant time per update. In a similar way, we periodically check keys in the secondary structure 
to see if they remain false positives, or can be moved to the primary structure. 

It is easy to see that the data structure will work correctly (under the assumption of proper 
deletions and insertions). What is less obvious is how much extra space is needed for the secondary 
structure. Observe that we may without loss of generality assume that the false positive rate is 
at most 1/logti, since we allow a space overhead of 0(n log log n) bits, and n > u . This means 
that the expected number of false positives in a set of n elements is 0{n/logu), so storing this set 
requires just 0{n) bits in expectation. To ensure a high probability bound on the space usage, we 
need a stronger hash function to compute the signatures. In particular, from [DGM"'"92] it follows 
that using constant-degree polynomial hash functions we can ensure that the number of signature 
collisions, corresponding to false positives, will be within a constant factor of the expectation with 
probability 1 — u~^, for any desired constant c. 
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6 Directions for Future Research 

Our work raises several fundamental directions for future research both from a theoretical perspective 
and from a practical perspective. From a theoretical perspective, an interesting problem is to tighten 
our lower bound by identifying the leading constant in the additive Q{n log log n) factor. In addition, 
it would be interesting to explore whether our constructions can be improved by a data structure 
that simultaneously enjoys the best of both worlds: space consumption of (1 + o(l))nlog(l/e) + 
0(n log log n) bits and constant-time operations in the worst case with high probability. 

From a more practical perspective, while Bloom filters [Blo70] provide a practical solution in the 
setting where an upper bound n is known in advance, our cosntruction do not seem to enjoy the 
same level of practicality in the setting where such an upper is not known in advance. Specifically, 
our first construction supports membership queries in time O(logn), which may be too slow in 
some applications, and our second construction suffers from non-trivial hidden constants due to our 
de-amortization technique. It would be very interesting to design a practical solution that matches 
our space lower bound. 
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